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Abstract 



Data stored in a data warehouse are inherently multidimensional, but most data-pruning techniques 
(such as iceberg and top-fc queries) are unidimensional. However, analysts need to issue multidimensional 
queries. For example, an analyst may need to select not just the most profitable stores or — separately — 
the most profitable products, but simultaneous sets of stores and products fulfilling some profitability 
constraints. To fill this need, we propose a new operator, the diamond dice. Because of the interaction 
between dimensions, the computation of diamonds is challenging. 

We present the first diamond-dicing experiments on large data sets. 

Experiments show that we can compute diamond cubes over fact tables containing 100 million facts 
in less than 35 minutes using a standard PC. 



terms Theory, Algorithms, Experimentation 
keywords Diamond cube, data warehouses, information retrieval, OLAP 

1 Introduction 

In signal and image processing, software subsamples data [29] for visualization, compression, or analysis 
purposes: commonly, images are cropped to focus the attention on a segment. In databases, researchers have 
proposed similar subsampling techniques [3, 14], including iceberg queries [13,27,33] and top-k queries [21, 
22]. Formally, subsampling is the selection of a subset of the data, often with desirable properties such as 
representativity, conciseness, or homogeneity. Of the subsampling techniques applicable to OLAP, only the 
dice operator focuses on reducing the number of attribute values without aggregation whilst retaining the 
original number of dimensions. 

Such reduced representations are sometimes of critical importance to get good online performance in 
Business Intelligence (BI) applications [2, 13]. Even when performance is not an issue, browsing and visu- 
alizing the data frequently benefit from reduced views [4]. 

Often, business analysts are interested in distinguishing elements that are most crucial to their business, 
such as the k products jointly responsible for 50% of all sales, from the long tail [1] — the lesser elements. 
The computation of icebergs, top-k elements, or heavy-hitters has received much attention [7-9]. We wish 
to generalize this type of query so that interactions between dimensions are allowed. For example, a busi- 
ness analysts might want to compute a small set of stores and business hours jointly responsible for over 
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Table 1: Sales (in million dollars) with a 4,10 sum-diamond shaded: stores need to have sales above $10 mil- 
lion whereas product lines need sales above $4 million 





Chicago 


Montreal 


Miami 


Paris 


Berlin 


TV 


3.4 


0.9 


0.1 


0.9 


2.0 


Camcorder 


0.1 


1.4 


3.1 


2.3 


2.1 


Phone 


0.2 


8.4 


2.1 


4.5 


0.1 


Camera 


0.4 


2.7 


6.3 


4.6 


3.5 


Game console 


3.2 


0.3 


0.3 


2.1 


1.5 


DVD Player 


0.2 


0.5 


0.5 


2.2 


2.3 



80% of the sales. In this new setting, the head and tails of the distributions must be described using a mul- 
tidimensional language; computationally, the queries become significantly more difficult. Hence, analysts 
will often process dimensions one at a time: perhaps they would focus first on the most profitable business 
hours, and then aggregate sales per store, or perhaps they would find the must profitable stores and aggregate 
sales per hour. We propose a general model, of which the unidimensional analysis is a special case, that has 
acceptable computational costs and a theoretical foundation. In the two-dimensional case, our proposal is a 
generalization of Iterative Pruning [18], a graph-trawling approach used to analyze social networks. It 
also generalizes iceberg queries [13,27,33]. 

To illustrate our proposal in the BI context, consider the following example. Table [^represents the sales 
of different items in different locations. Typical iceberg queries might be requests for stores having sales of 
at least 10 million dollars or product lines with sales of at least 4 million dollars. However, what if the analyst 
wants to apply both thresholds simultaneously? He might contemplate closing both some stores and some 
product lines. In our example, applying the constraint on stores would close Chicago, whereas applying the 
constraint on product lines would not terminate any product line. However, once the shop in Chicago is 
closed, we see that the product line TV must be terminated which causes the closure of the Berlin store and 
the termination of two new product lines (Game console and DVD player). 

This multidimensional pruning query selects a subset of attribute values from each dimension that are 
simultaneously important. The operation is a diamond dice [32] and produces a diamond, as formally defined 
in Section [3] 

Other approaches that seek important attribute values, e.g. the Skyline operator [6,23], Dominant Rela- 
tionship Analysis [20], and Top-/c dominating queries [35], require dimension attribute values to be ordered, 
e.g. distance between a hotel and a conference venue, so that data points can be ordered. Our approach 
requires no such ordering. 

2 Notation 

Notation used in this paper is tabulated below. 
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c 




a data cube 


G 




aggregator COUNT or SUM 






u(slice j of dimension dim in cube C) 


\c\ = 




the number of allocated cells in cube C 


A, B 


cubes 


Di 




z th dimension of a data cube 






number of attribute values in dimension 

Di 


k 

hi 




number of carats 

number of carats of order 1 for Di 


d 




number of dimensions 


p 




max. number of attribute values per dim 


Pi 




max. number of attribute values for Di 


k(C) 




maximum carats in C 


COUNT-k(C) 


maximum carats in C, a is COUNT 



3 Properties of Diamond Cubes 

Given a database relation, a dimension D is the set of values associated with a single attribute. A cube C is 
the set of dimensions together with a map from some tuples in D\ x ■ ■ ■ x to real-valued measure values. 
Without losing generality, we shall assume that n\ < 712 < ■ . . < n^, where rii is the number of distinct 
attribute values in dimension i. 

A slice of order 5 is the set of cells we obtain when we fix a single attribute value in each of 5 different 
dimensions. For example, a slice of order is the entire cube, a slice of order 1 is the more traditional 
definition of a slice and so on. For a <i-dimensional cube, a slice of order d is a single cell. An aggregator is 
a function, a, from sets of values to the real numbers. 

Definition 1. Let a be an aggregator such as SUM or COUNT, and let k be some real-valued number. A cube 
has k carats over dimensions i±, . . . if for every slice x of order 5 along dimensions i\, . . . , is, we have 
a(x) > k. 

We can recover iceberg cubes by seeking cubes having carats of order d where a(x) returns the measure 
corresponding to cell x. The predicate a(x) < k could be generalized to include a(x) > k and other 
constraints. 

We say that an aggregator a is monotonically increasing if S' C S implies a(S') < o-(S). Similarly, a 
is monotonically decreasing if S' C S implies a(S') > a(S). Monotonically increasing operators include 
COUNT, MAX and SUM (over non-negative measures). Monotonically decreasing operators include MIN and 
SUM (over non-positive measures). 

We say a cube C is restricted from cube C if 

• they have the same number of dimensions 

• dimension i of C is a subset of dimension i of C 

• If in C, (vi , V2 , ■ ■ ■ , Vd) I— > m, then in C, (v± , V2 , . . • , v ^) 1— > m 
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Definition 2. Let A and B be two cubes with the same dimensions and measures restricted from a single 
cube C. Their union is denoted A U B. It is the set of attributes together with their measures, on each 
dimension, that appear in A, or B or both. The union of A and B is B if and only if A is contained in B: A 
is a subcube of B. 

Proposition 1. If the aggregator a is monotonically increasing, then the union of any two cubes having k 
carats over dimensions ij, . . . ,i$ has k carats over dimensions ii, . . . ,ig as well. 

Proof. Any slice x of the union of A and B contains a slice x' from at least A or B. Since x' is contained in 

x, and a(x') > k, we have a(x) > k. □ 

Hence, as long as a is monotonically increasing , there is a maximal cube having k carats over dimen- 
sions ii,...,ig, and we call such a cube the diamond. When a is not monotonically increasing, there may 
not be a unique diamond. Indeed, consider the even-numbered rows and columns of the following matrix, 
then consider the odd-numbered rows and columns. Both are maximal cubes with 2 carats (of order 1) under 
the SUM operator: 

1-1 1-1 
-1 1-1 1 

1-1 1-1 
-1 1-1 1 

Because we wish diamonds to be unique, we will require a to be 
The next proposition shows that diamonds are themselves nested. 

Proposition 2. The diamond having k! carats over dimensions i±, . . . ,ig is contained in the diamond having 
k carats over dimensions ii, . . . , is whenever k' > k. 

Proof. Let A be the diamond having k carats and B be the diamond having k! carats. By Proposition [I] 
AU B has at least k' carats, and because B is maximal, A U B = B; thus, A is contained in B. □ 

For simplicity, we only consider carats of order 1 for the rest of the paper. We write that a cube has 
k\ , &2, . . . , fcrf-carats if it has k{ carats over dimension D± ; when k\ = &2 = . . . = kd = k we simply write 
that it has k carats. 

One consequence of Proposition [2] is that the diamonds having various number of carats form a lattice 
(see Fig. [T]) under the relation "is included in." This lattice creates optimization opportunities: if we are given 
the 2, 1-carat diamond X and the 1, 2-carat diamond Y, then we know that the 2, 2-carat diamond must lie 
in both X and Y. Likewise, if we have the 2, 2-carat diamond, then we know that its attribute values must 
be included in the diamond above it in the lattice (such as the 2, 1-carat diamond). 

Given the size of a sum-based diamond cube (in cells), there is no upper bound on its number of carats. 
However, it cannot have more carats than the sum of its measures. Conversely, if a cube has dimension sizes 
rei,7i2, ... ,71^ and k carats, then its sum is at least fcmax(ni, 7i2, • • • , no). 

Given the dimensions of a COUNT-based diamond cube, ri\ < 712 < . ■ ■ < rid-i < rid, an upper bound 
for the number of carats A; of a subcube is flti n i- ^ n u PP er bound on the number of carats k{ for dimension 
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Figure 1: Part of the COUNT-based diamond-cube lattice of a 2 x 2 x 2 cube 

* i s llj=i j^i n i- A n alternate (and trivial) upper bound on the number of carats in any dimension is \C\, the 
number of allocated cells in the cube. For sparse cubes, this bound may be more useful. 

Intuitively, a cube with many carats needs to have a large number of allocated cells: accordingly, the next 
proposition provides a lower bound on the size of the cube given the number of carats. 

Proposition 3. For d > 1, the size S, or number of allocated cells, of a d-dimensional cube ofk carats sat- 
isfies S > &rnaxj g j lj2; ....d} n>i > k d /( d - 1 ); more generally, a k\, k 2 , • • ■ , k^-carat cube has size S satisfying 
S > max i6{li2i ... )d} k ini > (rUi,...,^) 17 ^- 

Proof. Pick dimension Di. the subcube has rij slices along this dimension, each with k allocated cells, 
proving the first item. 

We have that k(J2i n -i)/d < ^ max ie{i,2,...,(i} n i so that the size of the subcube is at least k(J2i Th)/d. 

If we prove that Yli n i > dk l '^ d ~ l > then we will have that ki^j^n^/d > k d ^ d ~ l "> proving the sec- 
ond item. This result can be shown using Lagrange multipliers. Consider the problem of minimizing 
J2i n i given the constraints Y\ i=1 2 d n i — k for j = 1,2, ...,d. These constraints are nec- 

essary since all slices must contain at least k cells. The corresponding Lagrangian is L = Yli n i + 
J2j A?(rii=i 2 j-i j+i d n i ~ k)- By inspection, the derivatives of L with respect to n\, ri2, . . . , rid 
are zero and all constraints are satisfied when n\ = n 2 = . . . = rid = k 1 ^^ 1 ^. For these values, 
Ylti n i = dk 1 /^ 1 " 1 and this must be a minimum, proving the result. The more general result follows 
similarly, by proving that the minimum of ^riiki is reached when rii = (nj=i d^*) 1 ~ /&i f° r ai l 
i's. □ 

We calculate the volume of a cube C as ni=i n « anc ^ density is the ratio of allocated cells, |C|, to the 
volume (\C\/ Yil=i n «)- Given a, its carat-number, k{C), is the largest number of carats for which the cube 
has a non-empty diamond. Intuitively, a small cube with many allocated cells should have a large k(C). 
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One statistic of a cube C is its carat-number, k(C), which is the largest number of carats for which the 
cube has a non-empty diamond. Is this statistic robust? I.e., with high probability, can changing a small 
fraction of the data set change the statistic much? Of course, typical analyses are based on thresholds (e.g. 
applied to support and accuracy in rule mining), and thus small changes to the cube may not always behave as 
desired. Diamond dicing is no exception. For the cube C in Fig.[3]and the statistic k(C) we see that diamond 
dicing is not robust against an adversary who can deallocate a single cell: deallocation of the second cell on 
the top row results means that the cube no longer contains a diamond with 2 carats. This example can be 
generalized. 

Proposition 4. For any b, there is a cube C from which deallocation of any b cells results in a cube C with 
k(C) = k{C) - n(b). 

Proof. Let C be a d-dimensional cube with rij = 2 with all cells allocated. We see that C has 2 d_1 carats 
and k(C) = 2 d ~ l (assume d > 1). Given b, set x = L^a^J- Because x > - 1 > £ - 1 e n(b), 

it suffices to show that by deallocating b cells, we can reduce the number of carats by x. By Proposition [3] 
we have that any cube with 2 d ~ l — x carats must have size at least (2 d ~ 1 — x) d ^ d l \ When x <C 2 d ~ 1 , this 
size is approximately 2 d ~ 1 — J^j-, and slightly larger by the alternation of the Taylor expansion. Hence, if 
we deallocate at least cells, the number of carats must go down by at least x. But x = L 2d J ^ x — 
^ d 2^ b =^ b > which shows the result. It is always possible to choose d large enough so that x <C 2 d ~ l 
irrespective of the value b. □ 

Conversely, in Fig.[3]we might allocate the cell above the bottom-right corner, thereby obtaining a 2-carat 
diamond with all 2n + 1 cells. Compared to the original case with a 4-cell 2-carat diamond, we see that a 
small change effects a very different result. Diamond dicing is not, in general, robust. However, it is perhaps 
more reasonable to follow Pensa and Boulicaut [28] and ask whether k appears, experimentally, to be robust 



against random noise on realistic data sets. We return to this in Subsection 6.5 

Many OLAP aggregators are distributive, algebraic and linear. An aggregator a is distributive [16] if 
there is a function F such that for all < k < n — 1, 

cr(a , . . . , afc,afc+i, . . . , a n _i) = F(a(a , ■ ■ ■ , a k ),a(a k+ i, . . . , a n _i)). 

An aggregator a is algebraic if there is an intermediate tuple-valued distributive range-query function G 
from which a can be computed. An algebraic example is AVERAGE: given the tuple (COUNT, SUM), one can 
compute AVERAGE by a ratio. In other words, if a is an algebraic function then there must exist G and F 
such that 

G(a , . . . , afc, afc+i, . . . , a n _i) = F(G(ao, . . . , a k ), G(a& + i, . . . , a„,-i)). 
An algebraic aggregator a is linear [19] if the corresponding intermediate query G satisfies 

G(a + ad , «n-i + otd n -i) = G(a , a n „i) + aG(d , d n -i) 
for all arrays a, d, and constants a. SUM and COUNT are linear functions; MAX is not linear. 
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4 Related Problems 



In this section, we discuss four problems, three of which are NP-hard, and show that the diamond — while 
perhaps not providing an exact solution — is a good starting point. The first two problems, Trawling the Web 
for Cyber-communities and Largest Perfect Subcube, assume use of the aggregator COUNT whilst for the 
remaining problems we assume SUM. 

4.1 Trawling the Web for Cyber-communities 

In 1999, Kumar et al. [18] introduced the ITERATIVE PRUNING algorithm for discovering emerging com- 
munities on the Web. They model the Web as a directed graph and seek large dense bipartite subgraphs or 
cores, and therefore their problem is a 2-D version of our problem. Although their paper has been widely 
cited [30, 34] , to our knowledge, we are the first to propose a multidimensional extension to their problem 
suitable for use in more than two dimensions and to provide a formal analysis. 

4.2 Largest Perfect Cube 

A perfect cube contains no empty cells, and thus it is a diamond. Finding the largest perfect diamond is 
NP-hard. A motivation for this problem is found in Formal Concept Analysis [15], for example. 

Proposition 5. Finding a perfect subcube with largest volume is NP-hard, even in 2-D. 

Proof. A 2-D cube is essentially an unweighted bipartite graph. Thus, a perfect subcube corresponds directly 
to a biclique — a clique in a bipartite graph. Finding a biclique with the largest number of edges has been 
shown NP-hard by Peeters [26], and this problem is equivalent to finding a perfect subcube of maximum 
volume. □ 

Finding a diamond might be part of a sensible heuristic to solve this problem, as the next lemma suggests. 

Lemma 1. For COUNT-based carats, a perfect subcube of size n\ x ri2 x . . . x rid is contained in the 
Yii=i n, i/ m a x « rii-carat diamond and in the k±, &2, . . . , k^-carat diamond where ki = YYj=i n j/ n i- 

This helps in two ways: if there is a nontrivial diamond of the specified size, we can search for the 
perfect subcube within it; however, if there is only an empty diamond of the specified size, there is no perfect 
subcube. 

4.3 Densest Cube with Limited Dimensions 

In the OLAP context, given a cube, a user may ask to "find the subcube with at most 100 attribute values 
per dimension." Meanwhile, he may want to keep as much of the cube as possible. We call this problem 
Densest Cube with Limited Dimensions (DCLD), which we formalize as: pick m.m(rii,p) attribute 
values for dimension Di, for all i's, so that the resulting subcube is maximally dense. 

Intuitively, a densest cube should at least contain a diamond. We proceed to show that a sufficiently 
dense cube always contains a diamond with a large number of carats. 
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Proposition 6. If a cube does not contain a k-carat subcube, then it has at most 1 + {k — 1) X]j=i( n i — -0 
allocated cells. Hence, it has density at most {1 + {k — 1) ^f=i( n « — -0)/ Tl^=i n «- ^ ore generally, a cube 
that does not contain a fci, &2, • • ■ , k^-carat subcube has size at most 1 + ^f = i — l)(^j — 1) and density 
at most (1 + Yli=i( k i ~ l )i n i ~ 1))/Il*=i n i- 

Proof. Suppose that a cube of dimension at most n\ x ri2 x . . . x contains no fc-carat diamond. Then one 
slice must contain at most k — 1 allocated cells. Remove this slice. The amputated cube must not contain a 
fc-carat diamond. Hence, it has one slice containing at most k — 1 allocated cells. Remove it. This iterative 
process can continue at most Yli( n i ~ 1) times before there is at most one allocated cell left: hence, there 
are at most (k — 1) Yli( n i — 1) + 1 allocated cells in total. The more general result follows similarly. □ 

The following corollary follows trivially from Proposition [6] 

Corollary 1. A cube of size greater than 1 + {k — 1) X^=i( n « — -0 allocated cells, that is, having density 
greater than 

l + (fc-l)Eti ("»-!) 

must contain a k-carat subcube. If a cube contains more than 1 + Yli=i(^i ~ — 1) allocated cells, it 

must contain a h\ , &2, • ■ • , k^-carat subcube. 

Solving for k, we have a lower bound on the maximal number of carats: k(C) > \C\/ Yli( n i — 1) — 3. 
We also have the following corollary to Proposition [6] 

Corollary 2. Any solution of the DCLD problem having density above 

i + (k-i)j2ti(™H^,p)-i) < i + (fc-i)d(p-i) 

nlimin(n l ,p) ~ YlUi^i 

must intersect with the k-carat diamond. 

When Hi > p for all i, then the density threshold of the previous corollary is (1 + (k — l)d(p — l))/p d : 
this value goes to zero exponentially as the number of dimensions increases. 

We might hope that when the dimensions of the diamond coincide with the required dimensions of the 
densest cube, we would have a solution to the DCLD problem. Alas, this is not true. Consider the 2-D cube 
in Fig. [2| The bottom-right quadrant forms the largest 3-carat subcube. In the bottom-right quadrant, there 
are 15 allocated cells whereas in the upper-left quadrant there are 16 allocated cells. This proves the next 
result. 

Lemma 2. Even if a diamond has exactly min(ni,pi) attribute values for dimension Di, for all i's, it may 
still not be a solution to the DCLD problem. 

We are interested in large data sets; the next theorem shows that solving DCLD and HCLD is difficult. 

Theorem 1. The DCLD and HCLD problems are NP-hard. 
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Figure 2: Example showing that a diamond (bottom-right quadrant) may not have optimal density. 

Proof. The EXACT BALANCED PRIME NODE CARDINALITY DECISION PROBLEM (EBPNCD) is NP-COmplete [10] — 
for a given bipartite graph G = (V\, V2,E) and a number p, does there exist a biclique U\ and U 2 in G such 
that \Ui\ =pand \U 2 \ = pt 

Given an EBPNCD instance, construct a 2-D cube where each value of the first dimension corresponds to 
a vertex of V\, and each value of the second dimension corresponds to a vertex of V 2 . Fill cell corresponding 
to vx, v 2 € V\ x V 2 with a measure value if and only if v\ is connected to v 2 . The solution of the DCLD 
problem applied to this cube with a limit of p will be a biclique if such a biclique exists. □ 

It follows that HCLD is also NP-hard by reduction of DCLD. 
4.4 Heaviest Cube with Limited Dimensions 

In the OLAP context, given a cube, a user may ask to "find a subcube with 10 attribute values per dimension." 
Meanwhile, he may want the resulting subcube to have maximal average — he is, perhaps, looking for the 
10 attributes from each dimension that, in combination, give the greatest profit. Note that this problem does 
not restrict the number of attribute values (p) to be the same for each dimension. 

We call this problem the Heaviest Cube with Limited Dimensions (HCLD), which we formalize 
as: pick mm(rii,pi) attribute values for dimension Di, for all i's, so that the resulting subcube has maximal 
average. We have that the HCLD must intersect with diamonds. 

Theorem 2. Using the SUM operator, a cube without any k\, k 2 , ■ ■ ■ , k^-carat subcube has sum less than 
Yli=i( n i + + max(fci, k2, . . . , kd) where the cube has size n\ X n 2 X . . . X rid- 

Proof. Suppose that a cube of dimension m x n 2 x . . . x rid contains no ki, k 2 , . . . , fcrf-sum-carat cube. 
Such a cube must contain at least one slice with sum less than k, remove it. The remainder must also not 
contain a fc-sum-carat cube, remove another slice and so on. This process may go on at most ^f=i( n « + 
1) times before there is only one cell left. Hence, the sum of the cube is less than Yli=i( n i + + 
max(fci, k2, . . . , kd). □ 

Corollary 3. Any solution to the HCLD problem having average greater than 

J2i=i( n i + + max(fci, k2, ...,k d ) 
nil n i 
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must intersect with the k\ , &2, . . . , k^- sum- carat diamond. 

5 Algorithm 

We have developed and implemented an algorithm for computing diamonds. Its overall approach is illus- 
trated by Example [T] That approach is to repeatedly identify an attribute value that cannot be in the diamond, 
and then (possibly not immediately) remove the attribute value and its slice. The identification of "bad" 
attribute values is done conservatively, in that they are known already to have a sum less than required (a 
is sum), or insufficient allocated cells (a is count). When the algorithm terminates, we are left with only 
attribute values that meet the condition in every slice: a diamond. 

Example 1. Suppose we seek a 4,10-carat diamond in Table^using Algorithm^ On a first pass, we can 
delete the attribute values "Chicago" and "TV" because their respective slices have sums below 10 and 4. 
On a second pass, value "Berlin," "Game console" and "DVD" can be removed because the sums of their 
slices were reduced by the removal of the values "Chicago" and "TV." The algorithm then terminates. 

Algorithms based on this approach will always terminate, though they might sometimes return an empty 
cube. The correctness of our algorithm is guaranteed by the following result. 

Theorem 3. Algorithm^is correct, that is, it always returns the fei, &2, • • • , k^-carat diamond. 

Proof. Because the diamond is unique, we need only show that the result of the algorithm, the cube A, is a 
diamond. If the result is not the empty cube, then dimension D{ has at least value k\ per slice, and hence it 
has ki carats. We only need to show that the result of Algorithm [T] is maximal: there does not exist a larger 
fei, &2, . . . , fcrf-carat cube. 

Suppose A' is such a larger k\, ki-, ■ ■ ■ , fcrf-carat cube. Because Algorithm [I] begins with the whole cube 
C, there must be a time when, for the first time, one of the attribute values of C belonging to A' but not A 
is deleted. This attribute is not written to the output file because its corresponding slice of dimension dim 
had value less than k A ^ m . At the time of deletion, this attribute's slice cannot have obtained more cells after 
it had been deleted, so it still has value less than k dim . Let C be the cube at the instant before the attribute 
is deleted, with all attribute values deleted so far. We see that C is larger than or equal to A' and therefore, 
slices in C corresponding to attribute values of A' along dimension dim must have more than fc d im carats. 
Therefore, we have a contradiction and must conclude that A' does not exist and that A is maximal. □ 

For simplicity of exposition, in the rest of the paper, we assume that the number of carats is the 
same for all dimensions. 

Our algorithm employs a preprocessing step that iterates over the input file creating d hash tables that 
map attributes to their cr-values. When a = COUNT, the u-values for each dimension form a histogram, 
which might be precomputed in a DBMS. 

These values can be updated quickly as long as a is linear: aggregators like SUM and COUNT are good 
candidates. If the cardinality of any of the dimensions is such that hash tables cannot be stored in main 
memory, then a file-based set of hash tables could be constructed. However, given a <i-dimensional cube, 
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input: file inFile containing d— dimensional cube C, integer k > 
output: the diamond data cube 

// preprocessing scan computes a values for each slice 
foreach dimension i do 
Create hash table htj 

foreach attribute value v in dimension i do 

if o~{ slice for value v of dimension i in C) > k then 

! hti(v) = a( slice for value v of dimension i in C) 
end 

end 

end 

stable <— false 
while -istable do 

Create new output file outFile // iterate main loop 
stable <— true 
foreach row r of inFile do 
{v 1} V2, ...,v d ) <-r 
if Vi G dom ht j, for all 1 < i < d then 

I write r to outFile 
else 

for j g {1, . . . , i — 1, i + 1, . . . , d} do 
if Vj G dom ht j then 

htj(vj) =htj(uj)-ff(M) 
if ht j(vj) < k then 

remove Vj from dom htj 

end 

end 

end 

stable <— false 

end 

end 

if -istable then 

j inFile <— outFile // prepare for another iteration 
end 

end 

return outFile 

Algorithm 1: Diamond dicing for relationally stored cubes. Each iteration, less data is processed. 
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Figure 3: An n x n cube with 2n allocated cells (each indicated by a 1) and a 2-carat diamond in the upper 
left: it is a difficult case for an iterative algorithm. 

there are only Yli=i n i slices and so the memory usage is 0(X)i=i n «) : f° r our tests, main memory hash 
tables suffice. 

Algorithm [TJ reads and writes the files sequentially from and to disk and does not require potentially 
expensive random access, making it a candidate for a data parallel implementation in the future. 

Let I be the number of iterations through the input file till convergence; ie no more deletions are done. 
Value / is data dependent and (by Fig. [5]) is n i) m tne worst case. In practice, we do not expect I to 

be nearly so large, and working with our largest "real world" data sets we never found / to exceed 100. 

Algorithm [TJruns in time 0(Id\C\); each attribute value is deleted at most once. In many cases, the input 
file decreases substantially in the first few iterations and those cubes will be processed faster than this bound 
suggests. The more carats we seek, the faster the file will decrease initially. 

The speed of convergence of Algorithm [TJ and indeed the size of an eventual diamond may depend on 
the data-distribution skew. Cell allocation in data cubes is very skewed and frequently follows Zipfian/- 
Pareto/zeta distributions [24]. Suppose the number of allocated cells C Ai _ m ^ in a given slice i follows a zeta 
distribution: P(C d i m ,t = j) oc j~ s for s > 1. The parameter s is indicative of the skew. We then have 
that P(C d ± m: i < ki) = Y^j=i j s V Yl'jLi J~ s = Pki,s- The expected number of slices marked for deletion 
after one pass of over all dimensions using a = COUNT, prior to any slice deletion, is thus Yli=i n iPki,s- 
This quantity grows fast to Yli=i n « ( an slices marked for deletion) as s grows (see Fig. |4j>. For SUM-based 
diamonds, we not only have the skew of the cell allocation, but also the skew of the measures to accelerate 
convergence. In other words, we expect Algorithm[TJto converge quickly over real data sets, but more slowly 
over synthetic cubes generated using uniform distributions. 

5.1 Finding the Largest Number of Carats 

The determination of k(C), the largest value of k for which C has a non-trivial diamond, is a special case of 
the computation of the diamond-cube lattice (see Proposition [2]). Identifying k(C) may help guide analysis. 
Two approaches have been identified: 

1. Assume a = COUNT. Set the parameter k to 1 + the lower bound (provided by Proposition [6] or 
Theorem [2]) and check whether there is a diamond with k carats. Repeat, incrementing k, until an 
empty cube results. At each step, Proposition [2] says we can start from the cube from the previous 
iteration, rather than from C. When a is Sum, there are two additional complications. First, the value 
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Figure 4: Expected fraction of slices marked for deletion after one pass under a zeta distribution for various 
values of the skew parameter s. 

of k can grow large if measure values are large. Furthermore, if some measures are not integers, the 
result need not be an integer (hence we would compute |_ K (C)J by applying this method, and not 

k{C)). 

2. Assume a = COUNT. Observe that k(C) is in a finite interval. We have a lower bound from Proposi- 
tion ^ or Theorem [2] and an upper bound Ylt=i n i or I C I - (If this upper bound is unreasonably large, 
we can either use the number of cells in our current cube, or we could start with the lower bound and 
repeatedly double it.) Execute the diamond-dicing algorithm and set k to a value determined by a 
binary search over its valid range. Every time the lower bound changes, we can make a copy of the 
resulting diamond. Thus, each time we test a new midpoint k, we can begin the computation from the 
copy (by Proposition^. If a is SUM and measures are not integer values, it might be difficult to know 
when the binary search has converged exactly. 

We believe the second approach is better. Let us compare one iteration of the first approach (which 
begins with a /c-carat diamond and seeks a k + 1 -carat diamond) and a comparable iteration of the second 
approach (which begins with a A;-carat diamond and seeks a (k + fc U pper)/2-carat diamond). Both will end up 
making at least one scan, and probably several more, through the fc-carat diamond. Now, we experimentally 
observe that k values that slightly exceed k(C) tend to lead to several times more scans through the cube than 
with other values of k. Our first approach will make only one such unsuccessful attempt, whereas the binary 
search would typically make several unsuccessful attempts while narrowing in on k(C). Nevertheless, we 
believe the fewer attempts will far outweigh this effect. We recommend binary search, given that it will find 
k(C) in 0(log k(C)) iterations. 

If one is willing to accept an approximate answer for k(C) when aggregating with SUM, a similar ap- 
proach can be used. 



5.2 Diamond-Based Heuristic for DCLD 



In Section 4.4 we noted that a diamond with the appropriate shape will not necessarily solve the DCLD 
problem. Nevertheless, when we examined many small random cubes, the solutions typically coincided. 
Therefore, we suggest diamond dicing as a heuristic for DCLD. 
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A heuristic for DCLD can start with a diamond and then refine its shape. Our heuristic first finds a 
diamond that is only somewhat too large, then removes slices until the desired shape is obtained. See 
Algorithm [2] 

input: d-dimensional cube C, integers p\,p2, ■ ■ -Pd 
output: Cube with size pi x p2 x . . . x p^ 

/ I Use binary search to find k 

Find max k where the /c-carat diamond A has shape p\ x p' 2 x . . . x p' d , where Vi.p'j > pi 
for i <— 1 to d do 

Sort slices of dimension i of A by their a values 

Retain only the top p\ slices and discard the remainder from A 

end 

return A 

Algorithm 2: DCLD heuristic that starts from a diamond. 

6 Experiments 

We wish to show that diamonds can be computed efficiently. We also want to review experimentally some of 
the properties of diamonds including their density (count-based diamonds) and the range of values the carats 
may take in practice. Finally, we want to provide some evidence that diamond dicing can serve as the basis 
for a DCLD heuristic. 

6.1 Data Sets 

We experimented with diamond dicing on several different data sets, some of whose properties are laid out 
in Tables [2] and [5] 

Cubes TW1 , TW2 and TW3 were extracted from TWEED [12], which contains over 11,000 records of 
events related to internal terrorism in 18 countries in Western Europe between 1950 and 2004. Of the 52 di- 
mensions in the TWEED data, 37 were measures since they decomposed the number of people killed/injured 
into all the affected groups. Cardinalities of the dimensions ranged from 3 to 284. Cube TW1 retained 
dimensions Country, Year, Action and Target with cardinalities of 16 x 53 x 11 x 11. For cubes TW2 and 
TW3 all dimensions not deemed measures were retained. Cubes TW2 and TW3 were rolled-up and stored 



Table 2: Real data sets used in experiments 





TWEED 


Netflix 


Census income 


cube 


TW1 


TW2 


TW3 


NF1 


NF2 


C1 


C2 


dimensions 


4 


15 


15 


3 


3 


28 


28 


\C\ 


1957 


4963 


4963 


100,478,158 


100,478,158 


196054 


196054 




88 


674 


674 


500,137 


500,137 


533 


533 


measure 


count 


count 


killed 


count 


rating 


stocks 


wage 


iters to converge 


6 


10 


3 


19 


40 


6 


4 


K 


38 


37 


85 


1,004 


3,483 


99,999 


9,999 
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in a MySQL database using the following query and the resulting tables were exported to comma separated 
files. A similar process was followed for TW1 . Table [3] lists the details of the TWEED data. 

INSERT INTO tweed!5( dl , d2 , d3 , d4 , d5 , d6 , d7 , d8 , 

d31 , d32 , d33, d34 , d50 , d51 , d52 , d49 ) 
SELECT dl , d2, d3 , d4 , d5 , d6 , d7 , d8 , 

d31, d32, d33, d34 , d50 , d51 , d52 , sum( d9 ) 
FROM 'tweed ' 

GROUP BY (dl , d2, d3 , d4 , d5 , d6 , d7 , d8 , 
d31 , d32, d33, d34 , d50 , d51 , d52 ) 

We also processed the Netfhx data set [25], which has dimensions: MovielD x UserlD x Date x Rating 
(17766 x 480189 x 2182 x 5). Each row in the fact table has a distinct pair of values (MovielD, UserlD). 
We extracted two 3-D cubes NF1 and NF2 both with about 10 8 allocated cells using dimensions MovielD, 
UserlD and Date. For NF2 we use Rating as the measure and the SUM aggregator, whereas NF1 uses the 
COUNT aggregator. The Netflix data set is the largest openly available movie-rating database (ss 2GiB). 

Our third real data set, Census-Income, comes from the UCI KDD Archive [17]. The cardinalities of the 
dimensions ranged from 2 to 91 and there were 199,523 records. We rolled-up the original 41 dimensions 
to 27 and used two measures, income from stocks(C1) and hourly wage(C2). The MySQL query used to 
generate cube C1 follows. Note that the dimension numbers map to those given in the census-income.names 
file [17]. Details are provided in table [4] 

INSERT INTO census -income stocks (' dO ' , 'dl', ' d2 ' , ' d3 ' , ' d4 ' , ' d6 ' , 
'd7', 'd8', 'd9', 'dlO', ' dl2 ' , 'dl3', 'dl5', ' d21 ' , ' d23 ' , 
'd24', 'd25', 'd26', ' d27 ' , ' d28 ' , ' d29 ' , 'd31 ' , ' d32 ' , 'd33 ' , 
'd34 ' , 'd35 ' , 'd38 ' , ' dl8 ') 

SELECT 'dO', 'dl', ' d2 ' , ' d3 ' , ' d4 ' , ' d6 ' , 'dl', ' d8 ' , ' d9 ' , ' dlO ' , 
'dl2', 'dl3','dl5', 'd21' ' d23 ' , ' d24 ' , ' d25 ' , ' d26 ' , 

'd27', 'd28', 'd29', 'd31', ' d32 ' , 'd33', ' d34 ' , ' d35 ' , 'd38', sum('d!8') 
FROM census -income 

GROUP BY 'dO', 'dl', ' d2 ' , ' d3 ' , ' d4 ' , ' d6 ' , ' d7 ' , ' d8 ' , ' d9 ' , 

'dlO', 'dl2', 'dl3', 'dl5', ' d21 ' , ' d23 ' , ' d24 ' , ' d25 ' , ' d26 ' , ' d27 ' , 
'd28', 'd29', 'd31', ' d32 ' , 'd33', ' d34 ' , 'd35', 'd38'; 

We also generated synthetic data. As has already been stated, cell allocation in data cubes is skewed. 
We modelled this by generating values in each dimension that followed a power distribution. The values in 
dimension i were generated as \ riiU l l a \ where u € [0, 1] is a uniform distribution. For a = 1, this function 
generates uniformly distributed values. The dimensions are statistically independent. We picked the first 
250,000 distinct facts. Since cubes S2A and S3A were generated with close to 250,000 distinct facts we 
decided to keep them all. 

The cardinalities for all synthetic cubes are laid out in Table [6] All experiments on our synthetic data 
were done using the measure COUNT. 
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Table 3: Measures and dimensions of TWEED data. Shaded dimensions are those retained for TW1 . All 
dimensions were retained for cubes TW2 and TW3 (with total people killed as its measure) 





Dimension 


Dimension cardinality 


dl 


Day 


32 


d2 


Month 


13 


d3 


Year 


53 


d4 


Country 


16 


d5 


Type of agent 


3 


d6 


Acting group 


287 


d7 


Regional context of the agent 


34 


d8 


Type of action 




d31 


State institution 


6 


d32 


Kind of action 


4 


d33 


Type of action by state 


7 


d34 


Group against which the state action is directed 


182 


d50 


Group's attitude towards state 


6 


d51 


Group's ideological character 


9 








Measure 


d49 


total people killed 




people from the acting group 


military 


police 


civil servants 


politicians 


business executives 


trade union leaders 


clergy 


other militants 


civilians 






total people injured 


acting group 


military 


police 


civil servants 


politicians 


business 


trade union leaders 


clergy 


other militants 


civilians 






total people killed by state institution 


group members 


other people 




total people injured by state institution 


group members 


other people 




arrests 


convictions 


executions 


total killed by non-state group 


at which the state directed an action 




people from state institution 


others 




total injured by non-state group 


people from state institution 


others 
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Table 4: Census Income data: dimensions and cardinality of dimensions. Shaded dimensions and mea- 
sures retained for cubes C1 and C2. Dimension numbering maps to those described in the file census- 
income.names [17] 



Dimension 



Dimension cardinality 



dO age 91 

dl class of worker 9 

d2 industry code 52 

d3 occupation code 47 

d4 education 17 

d6 enrolled in education last week 3 

d7 marital status 7 

d8 major industry code 24 

d9 major occupation code 15 

dlO race 5 

dl2 sex 2 

dl3 member of a labour union 3 

dl5 full or part time employment status 8 

d21 state of previous residence 51 

d23 detailed household summary in household 8 

d24 migration code - change in msa 10 

d25 migration code - change in region 9 

d26 migration code - moved within region 10 

d27 live in this house 1 year ago 3 

d28 migration previous residence in sunbelt 4 

d29 number of persons worked for employer 7 

d31 country of birth father 43 

d32 country of birth mother 43 

d33 country of birth self 43 

d34 citizenship 5 

d35 own business or self employed 3 

d38 weeks worked in year 53 

dll hispanic origin 10 

dl4 reason for unemployment 6 

dl9 tax filer status 6 

d20 region of previous residence 6 

d22 detailed household and family status 38 
ignored instance weight 

d30 family members under 18 5 

d36 fill inc questionnaire for veteran's admin 3 

d37 veteran's benefits 3 

d39 year 2 
ignored classification bin 



Measure 



Cube 



dl8 dividends from stocks 

d5 wage per hour 

dl6 capital gains 

dl7 capital losses 



C1 
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Table 5: Synthetic data sets used in experiments 



cube 


Q1 A 


O 1 D 


Q1P 
O I \j 


COA 


COD 


COP 


CQA 


COD 
OOD 


cop 


dimensions 


4 


4 


4 


8 


8 


8 


16 


16 


16 


skew factor 


0.02 


0.2 


1.0 


0.02 


0.2 


1.0 


0.02 


0.2 


1.0 


\C\ 


250k 


250k 


250k 


251k 


250k 


250k 


262k 


250k 


250k 




11,106 


11,098 


11,110 


22,003 


22,195 


22,220 


38,354 


44,379 


44,440 


iters to converge 


12 


9 


2 


6 


12 


12 


8 


21 


6 


K 


135 


121 


30 


133 


32 


18 


119 


8 
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Table 6: Dimensional cardinalities for our synthetic data cubes 
Cube Dimensional cardinalities 
S1A 6 x 100 x 1000 x 10000 
S1B 2 x 100 x 1000 x 9996 
S1C 10 x 100 x 1000 x 10000 

S2A 10 x 100 x 1000 x 9881 x 10 x 100 x 1000 x 9902 
S2B 10 x 100 x 1000 x 9987 x 10 x 100 x 1000 x 9988 
S2C 10 x 100 x 1000 x 10000 x 10 x 100 x 1000 x 10000 
S3A 10 x 100 x 1000 x 8465 x 10 x 100 x 1000 x 8480 
xlO x 100 x 1000 x 8502 x 10 x 100 x 1000 x 8467 

S3B 10 x 100 x 1000 x 9982 x 10 x 100 x 1000 x 9987 
xlO x 100 x 1000 x 9988 x 10 x 100 x 1000 x 9982 



S3C 10 x 100 x 1000 x 10000 x 10 x 100 x 1000 x 10000 
xlO x 100 x 1000 x 10000 x 10 x 100 x 1000 x 10000 



All experiments were carried out on a Linux-based (Ubuntu 7.04) dual-processor machine with Intel 
Xeon (single core) 2.8 GHz processors with 2 GiB RAM. It had one disk, a Seagate Cheetah ST373453LC 
(SCSI 320, 15kRPM, 68 GiB), formatted to the ext3 filesystem. Our implementation was done with Sun's 
SDK 1.6.0 and to handle the large hash tables generated when processing Netflix, we set the maximum heap 
size for the JVM to 2 GiB. 



6.2 Iterations to Convergence 

Algorithm [T] required 19 iterations and an average of 35 minutes to compute the 1004-carat ft-diamond for 
NF1 . However it took 50 iterations and an average of 60 minutes to determine that there was no 1005 -carat 
diamond. The preprocessing time for NF1 was 22 minutes. For a comparison, sorting the Netflix comma- 
separated data file took 29 minutes. Times were averaged over 10 runs. Fig. [5] shows the number of cells 
present in the diamond after each iteration for 1004-1006 carats. The curve for 1006 reaches zero first, 
followed by that for 1005. Since k(NF1 ) = 1004, that curve stabilizes at a nonzero value. We see a similar 
result for TW2 in Fig. [6] where k is 37. It takes longer to reach a critical point when k only slightly exceeds 

K. 

As stated in Section [5J the number of iterations required until convergence for all our real and synthetic 
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Figure 5: Cells remaining after each iteration of Algorithm [T] on NF1 , computing a 1004-, 1005- and 1006- 
carat diamonds. 
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Figure 6: Cells remaining after each iteration, TW2 



cubes was far fewer than the upper bound, e.g. cube S2B: 2,195 (upper bound) and 12 (actual). We had 
expected to see the uniformly distributed data taking longer to converge than the skewed data. This was not 
the case. It may be that a clearer difference would be apparent on larger synthetic data sets. This will be 
investigated in future experiments. 

6.3 Largest Carats 

According to Proposition |6j COUNT-«;(NF1 ) > 197. Experimentally, we determined that it was 1004. By 
the definition of the carat, it means we can extract a subset of the Netflix data set where each user entered 
at least 1004 ratings on movies rated at least 1004 times by these same users during days where there were 
at least 1004 ratings by these same users on these same movies. The 1004-carat diamond had dimensions 
3082 x 6833 x 1351 and 8,654,370 cells, for a density of about 3 x 10~ 4 or two orders of magnitude denser 
than the original cube. The presence of such a large diamond was surprising to us. We believe nothing 
similar has been observed about the Netflix data set before [5]. 

Comparing the two methods in Section |5.1| we see that sequential search would try 809 values of k 
before identifying n. However, binary search would try 14 values of k (although 3 are between 1005 and 
1010, where perhaps double or triple the normal number of iterations are required). To test the time difference 
for the two methods, we used cube TW1 . We executed a binary search, repeatedly doubling our lower bound 
to obtain the upper limit, and thus until we established the range where k must exist. Whenever we exceeded 
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Figure 7: Comparison between estimated k, based on the lower bounds from Proposition [6j and number of 
(COUNT-based) carats found. 

k, a copy of the original data was used for the next step. Even with this copying step and the unnecessary 
recomputation from the original data, the time for binary search averaged only 2.75 seconds. Whereas a 
sequential search, that started with the lower bound and increased k by one, averaged 9.854 seconds over ten 
runs. 

Fig. [7] shows our lower bounds on k, given the dimensions and numbers of allocated cells in each cube, 
compared with their actual k values. The plot indicates that our lower bounds are further away from actual 
values as the skew of the cube increases for the synthetic cubes. Also, we are further away from n for TW2, 
a cube with 15 dimensions, than for TW1 . For uniformly-distributed cubes S1 C, S2C and S3C there was 
no real difference in density between the cube and its diamond. However, all other diamonds experienced an 
increase of between 5 and 9 orders of magnitude. 

Diamonds found in C1 , C2, NF2 and TW3 captured 0.35%, 0.09%, 66.8% and 0.6% of the overall sum 
for each cube respectively. The very small fraction captured by the diamond for TW3 can be explained by 
the fact that k(TW3) is based on a diamond that has only one cell, a bombing in Bologna in 1980 that killed 
85 people. Similarly, the diamond for C2 also comprised a single cell. 



6.4 Effectiveness of DCLD Heuristic 



To test the effectiveness of our diamond-based DCLD heuristic (Subsection |5.2[ ), we used cube TW1 and 
set the parameter p to 5. We were able to establish quickly that the 38-carat diamond was the closest to 
satisfying this constraint. It had density of 0.169 and cardinalities of 15 x 7 x 5 x 8 for the attribute values; 
year, country, action and target. The solution we generated to this DCLD (p = 5) problem had exactly 
5 attribute values per dimension and density of 0.286. 

Since the DCLD problem is NP-complete, determining the quality of the heuristic poses difficulties. We 
are not aware of any known approximation algorithms and it seems difficult to formulate a suitably fast ex- 
act solution by, for instance, branch and bound. Therefore, we also implemented a second computationally 
expensive heuristic, in hope of finding a high-quality solution with which to compare our diamond-based 
heuristic. This heuristic is based on local search from an intuitively reasonable starting state. (A greedy 
steepest-descent approach is used; states are ((Ai, A2, ■ ■ ■ , Ad), where \Ai\ = pi, and the local neighbour- 
hood of such a state h (A[, A' 2 , . . . , A' d ) , where Ai = A\ except for one value of i, where | A{ n A\ \ = pi — 1. 
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The starting state consists of the most frequent pi values from each dimension i. Our implemention actually 
requires the i th local move be chosen along dimension i mod d, although if no such move brings improve- 
ment, no move is made.) 

input: d-dimensional cube C, integers Pi,P2, ■ ■ - Pd 
output: Cube with size p\ x p 2 x . . . x p,i 
foreach dimension i do 

Sort slices of dimension i of A by their a values 

Retain only the top pi slices and discard the remainder from A 

end 
repeat 

for i <— 1 to d do 

//We find the best swap in dimension i 
bestAlternative <— <r(A) 

foreach value v of dimension i that has been retained in A do 

foreach value w from dimension i in C, but where w is not in A do 
Form A' by temporarily adding slice w and removing slice v from A 
if cr(A') > bestAlternative then 

| (rem, add) <— (v, w); bestAlternative <— cr(A') 
end 

end 

end 

if bestAlternative > er(A) then 

| Modify A by removing slice rem and adding slice add 
end 

end 

until A was not modified by any i 
return A 

Algorithm 3: Expensive DCLD heuristic. 

The density reported by Algorithm [3] was 0.283, a similar outcome, but at the expense of more work. 
Our diamond-based heuristic, starting with the 38-carat diamond, required a total of 15 deletes. Whereas our 
expensive comparision heuristic, starting with its 5 x 5 x 5 x 5 subcube, required 1420 inserts/deletes. Our 
diamond heuristic might indeed be a useful starting point for a solution to the DCLD problem. 

6.5 Robustness against randomly missing data 

We experimented with cube TW1 to determine whether diamond dicing appears robust against random 
noise that models the data warehouse problem [31] of missing data. Existing data points had an independent 
probability p m issing of being omitted from the data set, and we show p m i ss ing versus k(TW1 ) for 30 tests each 
with p m i ss ing values between 1% and 5%. Results are shown as in Table [7] Our answers were rarely more 
than 8% different, even with 5% missing data. 
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Table 7: Robustness of k(TW1) under various amount of randomly missing data: for each probability, 
30 trials were made. Each column is a histogram of the observed values of k(TW1 ). 



k(TW1 ) 


Prob. of cell's deallocation 
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7 Conclusion and Future Work 

We introduced the diamond dice, a new OLAP operator that dices on all dimensions simultaneously. This 
new operation represents a multidimensional generalization of the iceberg query and can be used by analysts 
to discover sets of attribute values jointly satisfying multidimensional constraints. 

We have shown that the problem is tractable. We were able to process the 2 GiB Netfiix data with 
500,000 distinct attribute values and 100 million cells in about 35 minutes, excluding preprocessing. As 
expected from the theory, real-world data sets have a fast convergence using Algorithm [T] the first few 
iterations quickly prune most of the false candidates. We have identified potential strategies to improve 
the performance further. First, we might selectively materialize elements of the diamond-cube lattice (see 
Proposition [2]). The computation of selected components of the diamond-cube lattice also opens up several 
optimization opportunities. Second, we believe we can use ideas from the implementation of ITERATIVE 
PRUNING proposed by Kumar et al. [18]. Third, Algorithm [T] is suitable for parallelization [11]. Also, 
our current implementation uses only Java's standard libraries and treats all attribute values as strings. We 
believe optimizations can be made by the preprocessing step that will greatly reduce overall running time. 

We presented theoretical and empirical evidence that a non-trivial, single, dense chunk can be discovered 
using the diamond dice and that it provides a sensible heuristic for solving the DENSEST CUBE WITH LIM- 
ITED DIMENSIONS. The diamonds are typically much denser than the original cube. Over moderate cubes, 
we saw an increase of the density by one order of magnitude, whereas for a large cube (Netfiix) we saw 
an increase by two orders of magnitude and more dramatic increases for the synthetic cubes. Even though 
Lemma [2] states that diamonds do not necessarily have optimal density given their shape, informal experi- 
ments suggest that they do with high probability. This may indicate that we can bound the sub-optimality, at 
least in the average case; further study is needed. 

We have shown that sum-based diamonds are no harder to compute than count-based diamonds and we 
plan to continue working towards an efficient solution for the Heaviest Cube with Limited Dimen- 
sions (HCLD). 
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